Web Crawling as an AI Project

نویسنده

  • Christopher H. Brooks
چکیده

This paper argues for the introduction of real-world programming projects into AI curricula, specifically using Python as an implementation language. We describe a modular set of projects centered around a focused web crawler, along with potential extensions. The author’s experiences using this project in a class of undergraduates and Master’s students are also discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

ARCOMEM Crawling Architecture

The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limita...

متن کامل

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Web Corpora Building

This paper introduces the RIDIRE-CPI, an open source tool for the building of web corpora with a specific design through a targeted crawling strategy. The tool has been developed within the RIDIRE Project, which aims at creating a 2 billion word balanced web corpus for Italian. RIDIRE-CPI architecture integrates existing open source tools as well as modules developed specifically within the RID...

متن کامل

Collaborative Web Crawler over High-speed Research Network

This paper proposes an idea for constructing a distributed web crawler by utilizing existing high-speed research networks. This is an initial effort of the Web Language Engineering (WLE) project which investigates techniques in processing the languages found in published web documents. In this paper, we focus on designing a geographically distributed web crawler. Multiple crawlers work collabor...

متن کامل

Augmenting Focused Crawling Using Search Engine Queries

The pervasiveness of the Internet makes it an ideal medium for sharing scholarly information. Nowadays, many authors post their publications online so that others may easily access to them, increasing the author’s impact in his/her research area. In this project, we develop a focused crawling to find publication pages, web pages that link to online, freely available scholarly publications. In c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008